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Abstract 

The smart electricity grid enables a two-way flow of power and data between suppliers and consumers in order to 
facilitate the power flow optimization in terms of economic eflrciency, reliability and sustainability. This infrastructure 
permits the consumers and the micro-energy producers to take a more active role in the electricity market and the 
dynamic energy management (DEM). The most important challenge in a smart grid (SG) is how to take advantage of 
the users’ participation in order to reduce the cost of power. However, effective DEM depends critically on load and 
renewable production forecasting. This calls for intelligent methods and solutions for the real-time exploitation of the 
large volumes of data generated by a vast amount of smart meters. Hence, robust data analytics, high performance 
computing, efficient data network management, and cloud computing techniques are critical towards the optimized 
operation of SGs. This research aims to highlight the big data issues and challenges faced by the DEM employed in SG 
networks. It also provides a brief description of the most commonly used data processing methods in the literature, and 
proposes a promising direction for future research in the field. 

Keywords: Big data. Smart grids. Dynamic energy management. Predictive analytics. Artificial intelligence, high 
performance computing. 


1. Introduction 

A smart grid (SG) is the next-generation power system 
able to manage electricity demand in a sustainable, reli¬ 
able and economic manner, by employing advanced digital 
information and communication technologies. This new 
platform aims to achieve steady availability of power, en¬ 
ergy sustainability, environmental protection, prevention 
of large-scale failures, as well as optimized operational ex¬ 
penses (OPEX) of power production and distribution, and 
reduced future capital expenses (CAPEX) for thermal gen¬ 
erators and transmission networks [1] . The upcoming tech¬ 
nology in the framework of SG facilitates the development 
and efficient interactive utilization of millions of alterna¬ 
tive distributed energy resources (DER) and electric vehi¬ 
cles m- To this end, each consumer location has to be 
equipped with a smart meter for monitoring and measur¬ 
ing the bi-directional flow of power and data, while super¬ 
visory control and data acquisition (SCADA) systems are 
needed to control the grid operation. 

While dynamic energy management (DEM) in conven¬ 
tional electricity grids is a well-investigated topic, this is 
not the case for SGs. This is due to its much more com¬ 
plicated nature, since complex decision-making processes 
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are required by the control centers mis]. Energy manage¬ 
ment systems (EMSs) in SGs include i) real-time wide-area 
situational awareness (WASA) of grid status through ad¬ 
vanced metering and monitoring systems, ii) consumers’ 
participation through home EMSs (HEMS), demand re¬ 
sponse (DR) algorithms, and vehicle-to-grid (V2G) tech¬ 
nology, and iii) supervisory control through computer- 
based systems [5]. A typical overview of the SG and the 
included systems and technologies is given in Fig. The 
quality and reliability of the data collected is a key factor 
for the optimized operation of the SG, thus rendering data 
mining and predictive analytics tools essential for the ef¬ 
fective management and utilization of the available sensor 
data [7]. This is because effective DEM relies dramati¬ 
cally on short-term power supply and consumption fore¬ 
casting, which handles prediction horizons from one hour 
up to one week [8]. Additionally, the sensor data contains 
important correlations, trends, and patterns that need to 
be exploited for the optimization of the energy consump¬ 
tion and the DR, among others |1]. Most of the research 
related to data mining in SGs deal with predictive ana¬ 
lytics and load classification (LC), which are necessary for 
the load forecasting, bad data correction, determination of 
the optimal energy resources scheduling, and setting of the 
power prices pun]. The efficient processing of the pro¬ 
duced vast amount of data requires increased data storage 
and computing resources, which imply the need for high 
performance computing (HPG) techniques. 

This work differs from other related surveys in the lit- 
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Figure 1: Smart Grid overview [6]. 


erature, such as mHH], in being the first meta-analytic 
review on efficient SG data processing with focus on DEM. 
Besides, it gives useful insights into technologies and meth¬ 
ods from the area of big data analytics (BDA) that have to 
be further explored into the framework of DEM, demand 
forecasting, and dynamic pricing. Our investigations re¬ 
veal that there is free space for research in the following 
topics: 

• Design and development of algorithms that can ac¬ 
curately extract the load patterns from large-scale 
datasets. 

• Design of machine learning (ML)-based algorithms 
with improved forecasting performance, low memory 
requirements, and scalable architecture. 

• Development of novel data-aware resource manage¬ 
ment systems that can provide powerful data process¬ 
ing in distributed computing systems and clusters for 
real-time processing. 

It is also highlighted that scalability and flexibility, 
achieved through the construction of robust algorithms 
and fast provisioning of HPC resources, can enable the 
efficient processing of the large data volumes involved in 
DEM and short-term power demand/supply forecasting. 

The rest of the paper is organized as follows. Section 
gives insight on the reasons why conventional data pro¬ 
cessing techniques are not appropriate for DEM in SGs. 
Section focuses on smart meter data stream mining and 
presents the most commonly used methods. Section is 
dedicated to the appropriate HPC techniques. Section 
provides promising future research directions, while Sec¬ 
tion concludes the paper. 


2. Dynamic Energy Management in SGs: A Big 
Data Issue 

DEM requires power flow optimization, system moni¬ 
toring, real-time operation, and production planning m- 
In more detail, DEM in a SG is a complicated, multi- 
variable procedure, since the latter enables an intercon¬ 
nected power distribution network by allowing a two-way 
flow of both power and data. This is in contrast to the 
traditional power grid, in which the electricity is gener¬ 
ated at a central source and then distributed to consumers. 
Thanks to the bi-directional flow of information and power 
between suppliers and consumers, the grids become more 
adaptive to the increased penetration of DER, encouraging 
also users’ participation in energy savings and cooperation 
through the DR mechanism [iniiiiiini. 

DR can be applied to both residential (e.g., cooling, 
heating, electric vehicles (EVs) charging, etc.) and indus¬ 
trial loads and includes three different concepts; i) energy 
consumption reduction, ii) energy consumption (or pro¬ 
duction) shifting to periods of low (or high) demand, and 
iii) efficient utilization of storage systems [50] . It should be 
noticed here that plug-in EVs can be considered as storage 
devices, while the careful scheduling of their charging and 
discharging can benefit both their owners and the utili¬ 
ties. Obviously, this further increases the parameters that 
the DEM algorithms have to take into account, such as 
the EVs charging profiles. Consequently, the associated 
complexity is also increased, creating at the same time 
storage capacity prediction problems |2T|. Thus, a crucial 
issue in SGs is how to manage DR in order to reduce peak 
electricity load, utilizing at the same time renewable en¬ 
ergies and storage systems more efficiently. Finally, effec¬ 
tiveness of DR algorithms depends critically on demand, 
price, load, and renewable energy forecasting, which high¬ 
lights the need for sophisticated signal processing tech¬ 
niques [55] . 

The electricity demand and renewable production in the 
SG environment is affected by several factors, including 
weather conditions, micro-climatic variations, time of day, 
random disturbances, electricity prices, DR, renewable en¬ 
ergy sources, storage cells, micro-grids, and the develop¬ 
ment of EVs [23lI26j . High forecasting accuracy accommo¬ 
dates the generation and transmission planning, i.e., decid¬ 
ing which power plants to operate and how much power 
should be generated by them at a specific time-period, 
with the aim to reduce the operating cost and increase the 
reliability m- It also enables the utilities to successively 
estimate the electricity cost and correctly set the electricity 
prices, capturing the interdependency between the energy 
demand and the prices [28] . A typical example of this in¬ 
terdependency is load-synchronization, where a large por¬ 
tion of load is shifted from hours of high prices to hours 
of low prices, without significantly reducing the peak-to- 
average ratio [23] . Moreover, insufficient monitoring and 
control of the power flow can increase the possibility of 
failure (e.g., due to load synchronization, overloading, con- 
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gestion, etc.). The power grid, which is consisted of multi¬ 
ple components such as relays, switches, transformers, and 
substations, must be carefully monitored. Therefore, the 
SG requires intelligent real-time monitoring techniques in 
order to be capable of detecting abnormal events, finding 
their location and causes, and most importantly predict¬ 
ing and eliminating faults before they happen. This self- 
healing behaviour, renders the power grid a real “immune 
system”, which is one of the most important character¬ 
istics of a SG framework, targeting uninterrupted power 
supply [2^l30] . 

In order to deal with the high level of uncertainties in 
DEM, the extreme size of data, and the need for real-time 
learning/decision making, the SG demands advanced data 
analytic techniques, big data management, and powerful 
monitoring techniques isnisa. Since BDA is one of the 
major driving forces behind a SG, various techniques such 
as artificial intelligence, distributed and HPG, simulation 
and modeling, data network management, database man¬ 
agement, data warehousing, and data analytics are to be 
used to guarantee smooth running of SGs. The main chal¬ 
lenges of Big Data approaches in SGs is the selection, de¬ 
ployment, monitoring, and analysis of aggregated data in 
real-time [33]. The BD’s role in SG collective awareness, 
the self-organization capability of SGs, and the service 
interruptions limitation are thoroughly discussed in [29j . 
Specifically, it is explained that the reliability of the elec¬ 
tricity grid could be enhanced if the users were aware of 
the effects of their personal energy use on the total con¬ 
sumption and overloading. 

Gonsidering the above, BDA can provide efficient so¬ 
lutions in specific problems related to data processing in 
SGs as described in the sequel and briefly summarized in 
Fig.§ which illustrates the roadmap to the DEM in SGs. 

3. Data Mining and Predictive Analytics in SGs 

Data mining is the standard process to harvest useful 
information from a stream of data, such as users’ con¬ 
sumption, injection of renewable power, and EVs’ state of 
battery, and transform it into an understandable structure 
for further use. The data mining process involves the uti¬ 
lization of algorithms for discovering patterns among the 
data following a similarity criterion |33|. Efficient data 
mining is crucial towards the optimized operation of the 
SG, since it strongly affects the decision-making of power 
producers and consumers, and the reliability of the grid. 

3.1. Dimensionality Reduction 

Smart meters generate large volume of data, and thus 
acquiring and processing all of them is inefficient -if not 
prohibitive- in terms of communication cost, computing 
complexity, and data storage resources utilization. For 
this purpose, dimensionality reduction has been applied 
in |53|, in order to provide a reduced version (sketch) of 
meters’ original data via random projection (RP). It is 


shown that processing the produced summarized version 
of data instead of the original stream of data leads to an 
acceptable relative error. The main advantage of RP is 
scalability, complexity reduction, and execution speed in¬ 
crease. 

Dimensionality reduction has only been sufhciently ex¬ 
plored in the area of synchrophasor data. Particularly, on¬ 
line dimensionality reduction has been proposed in |43j , in 
order to extract correlations between synchrophasor mea¬ 
surements, such as voltage, current, frequency etc. The 
proposed method can be used a preprocessing method in 
data analysis and storage, when a only an approximation 
of the initial data is required. Online dimensionality re¬ 
duction has also been successfully used in for early event 
detection in |44| , where an early event detection algorithm 
is proposed. 

3.2. Load Classification 

In classification problems a set of pre-classified data 
points are given and the classification algorithm tries to 
discover a rule, which describes as closely as possible the 
observed classification [T0| . LG is based on clustering pro¬ 
cess, which is used to discover groups and identify distri¬ 
butions in the provided data. 

For the successive LG in SGs, the most widely used mod¬ 
els are Artificial Neural Networks (ANNs). ANNs are com¬ 
putational models consisting of a large number of simple 
interconnected processors, which can be used to estimate 
approximate functions that depend on a large number of 
inputs when there is not an accurate mathematical model 
to describe the phenomenon (JS). This can be achieved 
by weighting and transforming the input values by a suit¬ 
able function with the aid of sequential sets of neurons 
(until an output neuron is activated). In |35|, ANNs have 
been used for the successful classification of consumer load 
curves in order to create patterns of consumption and facil¬ 
itate the selection of the appropriate DSM technique. Self¬ 
organizing mapping, also known as Kohonen neural net¬ 
work, which is an unsupervised neural networks method, 
has also been widely used for LC lillTj. 

Other commonly used algorithms are K-means, which 
is based on the Euclidean distance between objects, Fuzzy 
c-means, which is a local search fuzzy clustering method, 
and hierarchical clustering method, which is a model that 
can be viewed as a dendrogram |5]. Due to the ever¬ 
growing nature of the SG, a scalable approach is needed 
for the effective data harvesting and utilization. For this 
purpose, an effective online clustering has been proposed 
in |48j . based on unsupervised learning techniques, im¬ 
proving for this case the extended Glassifier System for 
clustering (XCSc). This XGSc-based method fits well the 
dynamic nature of SGs, while it outperforms the offline 
strategies in terms of the storage system performance |49j . 

3.3. Short-term Forecasting 

Short-term load forecasting (STLF) has been widely 
used over the last 30 years and several models have been 
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Concern 

Solution 

DR is huge challenge in practise since each user has a differ¬ 
ent reaction to the real-time prices and stochastic parameters 
such as weather, etc. 1101. Also, many inserted factors are 
interdependent 1281. 

Before optimizing the power flow and setting the prices, LC 
can be used in order to categorize various load patterns into 
a specific number of groups and each group can be charac¬ 
terized by each characteristic load pattern 1101. Predictive 
analytics and ML techniques can facilitate the real-time de¬ 
cision making |32|. 

The available communication resources, e.g., bandwidth are 
not enough for the data acquisition in a centralized man¬ 
ner [34]. 

Distributed data mining and dimensionality reduction re¬ 
duce the required resources l4]l34J[35l. 

Traditional computer techniques cannot handle fast data 
processing, which is required for real-time monitoring, dy¬ 
namic energy management and power flow optimization [36]. 

Efficient high performance computing techniques, such as 
virtualization and in-memory computing, can importantly 
reduce the computing time |3fiJI37|. 

Increased data storage and computing resources are required 
in order to meet the needs of processing the huge amount of 
data involved in DEM, which could possibly lead to increased 
OPEX and CAPEX [32]. 

Cloud computing: pay-per-use 1381. 

Security l39]f^ 

Data anonymization via data aggregation, encryption, 

etc. [41]. 


Figure 2: A roadmap of big data analytics in DEM 


developed, which can be summarized as follows: i) regres¬ 
sion models, ii) linear time-series-based, iii) state-space 
models, and iv) nonlinear time-series modeling m- More¬ 
over, very little progress has been made in the field of the 
very-short-term (VST) and ultra-short-term (UST) load 
forecasting in SGs, which, among others, are necessary for 
the successful self-healing of SG, since they are appropri¬ 
ate for a few minutes load forecasting [5DH52] . Among the 
appropriate methods for VST and UST, only basic ANNs 
have been widely tested. Generally, most of the research 
has been focused on large aggregated load data, where 
most individual variations are averaged out by the effect 
of the law of large numbers. These methods were seldom 
tried on individual meters or at meter aggregate levels, 
such as distribution feeders and substation [S]. Addition¬ 
ally, as shown in [8] , their performance degrades when the 
number of the considered meters goes down. For this pur¬ 
pose, a short-term load forecasting based on Empirical 
Mode Decomposition, Extended Kalman Filter and Ex¬ 
treme Learning with Kernel is proposed in |53j . which is 
more appropriate for load forecasting in micro-grids. Ker¬ 
nel methods has attracted the research interest in the area 
of STLF , since it increases the computing energy- 

efficiency [5B1159) . Most works in the existing literature 
ignore the interdependency between the demand and the 
electricity prices, while assuming that pricing setting fol¬ 
lows the load forecasting. To this end, the authors in [28] 


propose a multi-input multi-output forecasting engine for 
joint price and demand prediction using data association 
mining algorithms. For the effective price forecasting, the 
users’ comfort has also to be taken into account, which 
can be achieved via supervised machine algorithms |60j . 
For very short term wind power generation forecast, there 
are myriad of approaches, which can be divided into two 
big categories: i) forecasting the wind and direction for a 
specific windmill farm, and ii) forecasting the generated 
power in a single step [61]. More interesting details on 
this issue can be found in [61] . Finally, short-term photo¬ 
voltaic power prediction is mainly based on the past power 
output [5^ . 

3.4- Distributed Data Mining 

The traditional centralized frameworks for acquiring, 
analyzing and processing data, require huge exchange of 
information among the remote sensors (e.g., the smart 
meters and the centralized processor), which is inefficient 
in terms of telecommunication resources management and 
economic cost. To this end, the authors in [1] present sev¬ 
eral distributed data analysis techniques that can be suc¬ 
cessively used for energy demand prediction. The provided 
analysis emphasizes on the problem of multivariate regres¬ 
sion and rank ordering in a distributed scenario, based on 
polynomially bounded computations per node. Decentral¬ 
ized data mining algorithms have the advantage of scala- 


4 







bility, while they are less affected by peer failures and they 
need little computing and communication resources |55] . 

4. High Performance Computing 

Real-time monitoring, DEM, and power flow optimiza¬ 
tion are all based on fast data processing and BDA, which 
need high computing power. Efficient data mining algo¬ 
rithms based on task parallelism, using multi-core, clus¬ 
ter, and grid computing, can reduce the computational 
time [36j . However, covering the increased data storage 
and computing resources needs is still a big economic chal¬ 
lenge, mainly for the operators of the electricity grids. 
Therefore, distributed computing seems to be a promis¬ 
ing perspective [63]. 

^.1. Dedicated Computational Grid 

In order to enhance the existing computational capabil¬ 
ities and increase efficiency, a dedicated grid computing 
based framework is proposed in m- In more detail, an 
architecture of three layers is proposed, namely i) the re¬ 
source layer, which consists of the hardware part of the 
computing grid, ii) the grid middleware, which provides 
access of grid resources to the grid services, and iii) the ap¬ 
plication layer, which consists of the services. It is shown 
that this computational grid can provide HPC by com¬ 
bining the processing power, memory and storage of the 
available computers. 

4-2. Cloud Data 

The cloud computing (CC) model meets the require¬ 
ments of data and computing intensive SG applica¬ 
tions [38l [64] • The main advantages of CC over traditional 
models are energy saving, cost saving, agility, scalability, 
and flexibility, since computational resources are used on 
demand [5S]. Many approaches have been developed so 
far to further increase the energy efficiency of HPC data 
centers, such as energy conscious scheduling in [66], the 
cooperation with the SG in m and thermal-aware task 
scheduling in |^. In [35], a model for SG data manage¬ 
ment is presented, taking advantage of the main character¬ 
istics of CC computing, such as distributed data manage¬ 
ment, parallelization, fast retrieval of information, acces¬ 
sibility, interoperability and extensibility. Most of smart 
grid applications, such as advanced metering infrastruc¬ 
ture, SCADA, and energy management, can be facilitated 
by the available cloud service models, namely software as 
a service, platform as a service, and infrastructure as a 
service [68] . 

4-2.1. Security 

Confidentiality and privacy are big challenges towards 
the application of CC on SG data processing [331 HD]. To 
this end, the designed data architectures must be multi¬ 
tenant, following one of the three different approaches for 


such architectures, namely the separate databases, sepa¬ 
rate schemas, or shared schemas [35] ■ Also, privacy of end 
users and data anonymization can be guaranteed by data 
aggregation, which is used in most SG architectures. How¬ 
ever, from the electrical companies’ perspective, security 
is still a challenging problem, since the hackers of systems 
located in the cloud cannot be easily traced. Authentica¬ 
tion, encryption, trust management, and intrusion detec¬ 
tion are important security mechanisms that can prevent, 
detect and mitigate such network attacks [6||4T]. Finally, 
a related major issue is the recovery of data in the case of 
a possible failure of the cloud service [55]. 

5. Future Research Directions & Discussion 

The load data in SG environment is massive, dynamic, 
high-dimensional, and heterogeneous [3]. Thus, in order 
to build an accurate real-time monitoring and forecast¬ 
ing system, two novel concepts have to be taken into ac¬ 
count in the system design. First, as shown in Fig. [^ 
all available information from different sources, such as 
individual smart meters, energy consumption schedulers, 
aggregators, solar radiation sensors, wind-speed meters 
and relays has to be integrated, while a communication 
point has to be designed where multiple artificial experts 
can interact and make decisions on data. The appropri¬ 
ate forecasting system should rely on effective data sam¬ 
pling |69j , improved categorization of the information and 
successful recognition of the different patterns. Second, 
suitable adaptive algorithms and profiles for effective dy¬ 
namic, autonomous, distributed, self-organized and fast 
multi-node decision-making have to be designed. It has 
been shown that the performance of multi-node load fore¬ 
casting is clearly better than that of single-node forecast¬ 
ing m- The designed algorithms should be based on re¬ 
alistic consensus functions or scoring/voting models [71], 
where the large computations can be parallelized. The al¬ 
gorithmic results are the state estimation, the estimated 
production and consumption, and the STLF in SGs. For 
the most efficient pattern-recognition and state estimation 
in the SGs environment, the following methodologies and 
technologies can be used. 

5.1. SG Feature Selection and Extraction 

The factors that affect the load forecasting can be sep¬ 
arated in two categories: a) the traditional factors and 
b) the SG factors [27]. The traditional factors include the 
weather conditions, time of the day, season of the year, and 
random events and disturbances. On the other hand, the 
smart grid factors include the electricity prices, demand re¬ 
sponse, distributed energy sources, storage cells and elec¬ 
tric vehicles. As shown in Fig. [^ large volumes of data 
from the sensors installed around SG are collected, and 
the features extracted in this phase also need to be refined 
as there is noise and redundancy in the features. If the in¬ 
put features contain redundant information (e.g., highly 
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Figure 3: Proposed SG forecast model. 


correlated features), ML algorithms in general perform 
poorly because of numerical instabilities. Some regular¬ 
ization techniques can be imposed to solve such problems. 
The techniques that can be used to build an optimal subset 
for the load predicting problem in SG are greedy hill climb¬ 
ing |72] , minimum-redundancy-maximum-relevance m. 
regularized trees in], random multinomial logit m , etc. 
In addition to the image preprocessing methods described 
above, there are many other techniques that draw on dis¬ 
ciplines such as artificial intelligence (AI) and ML that can 
be used to analyze such selected features. 

5.2. Online Learning 

Online learning is a powerful way of dealing with load 
monitoring and prediction in SGs. An online learning algo¬ 
rithm observes a stream of examples and makes a predic¬ 
tion for each element in the stream m- The algorithm 
receives immediate feedback about each prediction and 
uses this feedback to improve its accuracy on subsequent 
predictions. In contrast to statistical ML, online learn¬ 
ing algorithms don’t make stochastic assumptions about 
the observed data, and even handle situations where the 
data is generated by a malicious adversary. There has 
been a recent surge of scientific research on online learn¬ 
ing algorithms, largely due to their broad applicability to 
web-scale forecasting problems. In load forecasting in SGs, 
the statistical ML properties of the target variable, which 
the model is trying to forecast, change over time in un¬ 
foreseen ways. This causes the so-called “concept drift” 
problem because the predictions become less accurate as 
time passes. In m, the employed online learning set up 
mitigates such problem. In online learning, soon after the 
forecasting is made, the true label of the instance is dis¬ 
covered. This information can then be used to refine the 
forecasting hypothesis used by the algorithm. The goal of 


the algorithm is to make forecasting that is close to the 
true labels. 

5.3. Randomized Model Averaging 
ML concerned with the design and development of al¬ 
gorithms that allow computers to evolve behaviours based 
on empirical data. A major focus of ML research is to 
automatically learn to recognize complex patterns such as 
the features of SGs, and make intelligent decisions based 
on data. Using multiple predictive ML models (each de¬ 
veloped using statistics and/or ML) to obtain better pre¬ 
dictive performance than could be obtained from any of 
the constituent models [781 ES]. In addition, the pro¬ 
posed scheme should be utilizing the model averaging tech¬ 
nique [SOj in order to improve the stability and accuracy 
of ML algorithms via reducing variance [SU |HI1 IH1| • 

5./. MapReduce Parallel Processing 
Load forecasting problems, involving intelligent ML so¬ 
lutions pertaining to large volume of data source gener¬ 
ation, are a perfect fit for MapReduce deployments m- 
MapReduce is a programming model for processing large 
datasets with a parallel, distributed algorithm on a com¬ 
puting cluster of low cost commodity computers. A 
MapReduce application typically consists of two phases 
(or operations); “map” and “reduce” with many tasks in 
each phase [H3|. A map/reduce task deals with a chunk of 
data independently and thus, tasks in a given phase can be 
easily parallelized and effectively processed in a large-scale 
computing environment (i.e., a cloud platform). However, 
the existing resource management models pay little atten¬ 
tion to the data and/or utilize a simple technique, called 
MapReduces rack-aware task placement. Such paralleliza¬ 
tion model could be reconstructed by explicitly taking 
the application characteristics and network topology into 


6 





















































account [Ml [SS]. For instance, in a tree network topol¬ 
ogy, two sub-trees (racks) adjacent to each other will be a 
better combination than two of dispersed sub-trees. The 
presented framework can enable the minimization of data 
movement and in turn reduce the occurrence of network 
contention, and this will eventually enhance the cloud sys¬ 
tems efficiency. Two main mechanisms that have to be 
taken into account explicitly are a) the data locality-aware 
scheduling algorithm, and b) the application-specific re¬ 
source allocation mechanisms. Specifically, tasks requir¬ 
ing common datasets are dispatched to computers (com¬ 
pute nodes) with close proximity to those data sets. For 
most data processing applications, storage capacity, such 
as disk and memory, is more important than computing 
power (i.e., CPUs). 

5.5. Available Testbeds and Platforms 

The majority of the available power grid systems focus 
on modeling of traditional network components, i.e., the 
generation systems, loads, and transmission network. A 
different approach is followed in |86| . where a distribu¬ 
tion grid testbed has been proposed, which can be used 
to test the designs of integrated information management 
systems. The purpose of this testbed is to successfully 
represent the correlation and interdependency among data 
sets, aiming to efficiently monitor the status of the SG and 
detect abnormalities. Interestingly enough, extensive sets 
of SG’s detailed trial data, which can be used in order 
to test the designed schemes, can be easily acquired, thus 
facilitating the research in this area [STMT] . In [57] , regis¬ 
tered users can also access the public model, including key 
functions, assumptions, and analytical tools. 

Storing and processing the huge amount of data gen¬ 
erated by the smart meters, requires improved platforms, 
appropriate for big data analytics, such as Hadoop, Cas¬ 
sandra, and Hive [M]. Hadoop is a promising platform for 
the distributed processing of large SG’s data sets. It is s 
a collection of open source tools and includes the concept 
of MapReduce. Gassandra database, which supports the 
cloud infrastructure, can be used in order to store the large 
data sets that are needed for the effective DEM. Moreover, 
Hive data warehouse software, which uses a simple SQL- 
like language, can be used to query datasets that are stored 
in a distributed environment. 

6. Conclusions 

In this paper, we have summarized the state-of-the-art 
in the exploitation of big data tools for dynamic energy 
management in smart grid platforms. We have first high¬ 
lighted that, in order to deal with the extreme size of 
data, the smart grid requires the adoption of advanced 
data analytics, big data management, and powerful mon¬ 
itoring techniques. Next, we elaborated on the utilization 
of the most commonly used smart grid data mining and 
predictive analytics methods, focusing on the smart me¬ 
ter data that are necessary for the accurate and efficient 


power conumption/supply forecasting. We proceeded with 
a brief survey on the works dealing with high performance 
computing, insisting on cost efficiency and security issues 
in the context of SG control. Finally, we discussed sev¬ 
eral interesting techniques and methods that have to be 
further explored into the framework of a real-time moni¬ 
toring and forecasting system, and we provided promising 
research directions for future research in the field. 
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